fix(hot): split history shards by encoded size, not count by dylanlott · Pull Request #66 · init4tech/storage

dylanlott · 2026-05-20T06:09:40Z

Closes ENG-2287

AccountsHistory and StorageHistory are stored DUPSORT, so each dup
value (key2 || encoded BlockNumberList) is capped at MDBX's DUPSORT
value limit (~1980 B on 4 KB pages). The previous splitter gated on
ShardedKey::SHARD_COUNT (2000), but a roaring BlockNumberList of 2000
sparse indices serialises to >20 KB. Once a hot account's shard tipped
over, every live block touching it crashed the sidecar with
MDBX_BAD_VALSIZE.

Replace count-based splitting in append_to_sharded_history with a
size-based check against a new ShardedKey::MAX_SHARD_BYTES (1500). When
the merged list overflows the budget, binary-search the largest prefix
that fits and emit shards until the remainder is consumed. Existing
oversized shards self-heal on the next write to them, so no migration
is required.

Add test_history_shard_fits_in_dupsort_limit to the hot conformance
suite. It exercises the real update_history_indices_inconsistent path
with 2000 sparse blocks (one per roaring container, the worst-case
encoding) and asserts the union across shards round-trips.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

dylanlott · 2026-05-20T06:09:52Z

fix(hot): split history shards by encoded size, not count #66 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

Closes ENG-2287 `AccountsHistory` and `StorageHistory` are stored DUPSORT, so each dup value (key2 || encoded BlockNumberList) is capped at MDBX's DUPSORT value limit (~1980 B on 4 KB pages). The previous splitter gated on ShardedKey::SHARD_COUNT (2000), but a roaring BlockNumberList of 2000 sparse indices serialises to >20 KB. Once a hot account's shard tipped over, every live block touching it crashed the sidecar with MDBX_BAD_VALSIZE. Replace count-based splitting in append_to_sharded_history with a size-based check against a new ShardedKey::MAX_SHARD_BYTES (1500). When the merged list overflows the budget, binary-search the largest prefix that fits and emit shards until the remainder is consumed. Existing oversized shards self-heal on the next write to them, so no migration is required. Add test_history_shard_fits_in_dupsort_limit to the hot conformance suite. It exercises the real update_history_indices_inconsistent path with 2000 sparse blocks (one per roaring container, the worst-case encoding) and asserts the union across shards round-trips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

prestwich · 2026-05-20T07:41:29Z

+    /// Soft cap on the number of indices in one shard.
+    ///
+    /// This is a sanity ceiling used alongside [`Self::MAX_SHARD_BYTES`];
+    /// shard splitting is driven by encoded size, not by this count.


Another thing we should consider is whether we can empirically find size bounds with best and worst case inputs

prestwich

there are writing 3 cases:

exceeds MAX_SHARD_BYTES
exceeds MAX_SHARD_BYTES and the MDBX dupsort max
(boring case) does not exceed MAX_SHARD_BYTES

previously:

case 2 would panic during writing
case 1 & 3 would work

there are no changes to the decoding, so anything written with previous versions can still decode. Different versions would chunk history differently, and therefore have slightly different real-world perf, but there should not be any version compatibility issues, as the history chunking is not consensus critical

How does the treemap encode the indicies?
Can we empirically determine a safe number of indices based on the encoding so that we don't have to do a binary search?
In pursuit of 2, let's add tests to the storage-types package with worst-case indices.
Let's promote the SHARD_COUNT to a type-associated constant on our BlockNumberList

prestwich · 2026-05-21T10:53:18Z

[Claude Code]

Posting on James's behalf — follow-up to the size-based-splitting discussion.

Agree the size-based approach is the right call (the count-based alternative imposes ~10× metadata overhead on hot, dense addresses, which dominate real workloads). The objection is to the binary search, not the size budget.

Why the current splitter is more expensive than it needs to be

max_prefix_fitting does up to log₂(n) probes per shard, and each probe calls BlockNumberList::new_pre_sorted(&indices[..mid]) + serialized_size() — i.e., constructs a fresh RoaringTreemap from scratch per iteration. For a 2000-entry oversized merge that splits into ~13 shards, that's ~140 full roaring constructions on the overflow path. Not catastrophic (overflow is rare), but unnecessary.

Also, the if mid == 1 { break; } branch silently returns best = 1 (the initial value, never probed) without verifying that a 1-element list actually fits. Theoretical footgun.

Three cheaper heuristics

A. Incremental insert + exact size check (simplest). Insert indices one at a time into a working BlockNumberList; query serialized_size() after each. It's O(num_containers) per call, not O(n), so total work is O(n × containers) — for the dense path (1 container) effectively O(n); for the pathological all-distinct-containers path O(n²) ≈ 4M tiny ops, still cheaper than the current binary search's repeated full re-encodings. No probing, exact.

B. Boundary-triggered estimate (O(n)). serialized_size() only jumps when a new container is created or a container transitions representation (array↔bitmap↔run). Both are detectable from the input itself: (idx >> 16) changing (new container) or input crossing density thresholds. Maintain a running upper-bound estimate (treat every container as array: +12 B per new container, +2 B per element); emit when it crosses budget. Conservative — gives up some bitmap/run compression slack on dense paths — so periodically recalibrate with a real serialized_size() to recover it.

C. Segment-based greedy pack (O(n), exact). Pre-segment the merged input by (hi32, hi16) in one pass; each segment's encoded cost is predictable from cardinality. Greedy-pack segments into shards.

Recommendation: Option A

Smallest delta from the current PR. Replace max_prefix_fitting with a single-pass incremental builder:

let mut shard_start = 0;
let mut shard = BlockNumberList::default();
for (i, &idx) in all.iter().enumerate() {
    shard.push(idx)?;
    if shard.serialized_size() > ShardedKey::MAX_SHARD_BYTES {
        // emit prefix [shard_start..i) — rebuild once at the boundary
        let prefix = BlockNumberList::new_pre_sorted(all[shard_start..i].iter().copied());
        write_shard(all[i - 1], &prefix)?;
        shard_start = i;
        shard = BlockNumberList::default();
        shard.push(idx)?;
    }
}
write_shard(u64::MAX, &shard)?;

One roaring rebuild per shard boundary instead of log₂(n) per shard. Exact, no probing, simpler to read.

Orthogonal asks from your earlier review still stand

The four points from the change-request — characterise roaring encoding, add worst-case tests in storage-types, derive a documented safe upper bound, and move encoding-derived constants onto BlockNumberList — are independent of which splitting heuristic we use, and would land cleanly on top of Option A. The worst-case test in signet-storage-types is what would justify the MAX_SHARD_BYTES = 1500 number rigorously (right now it's asserted by comment, not by test).

prestwich · 2026-05-21T11:11:17Z

[Claude Code]

Follow-up posted on James's behalf — layering observation, separable from the splitting-heuristic discussion above.

MAX_SHARD_BYTES = 1500 currently lives in signet-storage-types on ShardedKey, but that number is a property of MDBX's DUPSORT value cap (~1980 B on 4 KB pages, minus key2 and per-node overhead). Other backends that implement HotKv — MemKv today, hypothetical SQL/remote backends later — have no such limit, and forcing them to split into ~1500-byte shards just wastes work and inflates their on-disk/in-memory layout for no reason.

The dual-keyed table abstraction itself is fine; that's a deliberate primitive in this storage layer, not an MDBX-ism. The issue is purely that the byte budget is backend-specific while the splitting policy is generic.

Minimal cleanup that fixes the layering without touching the trait surface:

trait HotKvWrite {
    /// Maximum encoded size of a single stored value, in bytes.
    /// `None` means unbounded.
    fn max_value_bytes(&self) -> Option<usize> { None }
}

signet-hot-mdbx overrides to return Some(1500) (or computes from page size).
MemKv keeps the default None and writes one monolithic shard per address.
append_to_sharded_history reads self.max_value_bytes(); if None, fast-path emit; otherwise run the size-based splitter against that budget.

This keeps the shared splitter (so we don't duplicate the algorithm per backend), keeps the schema shared (dual-keyed history tables remain the abstraction's primitive), and removes the MDBX-specific constant from signet-storage-types.

BlockNumberList::serialized_size() stays in signet-storage-types — it's a property of the encoding. Only the budget moves.

Happy to file this as its own issue and let ENG-2287 land first, or fold it into the same PR if it's small enough. Probably the former — different scope, easier to review separately.

dylanlott force-pushed the dylan/split-shards branch from cf04073 to c512dff Compare May 20, 2026 06:11

dylanlott requested a review from prestwich May 20, 2026 06:20

dylanlott marked this pull request as ready for review May 20, 2026 06:21

prestwich reviewed May 20, 2026

View reviewed changes

prestwich requested changes May 20, 2026

View reviewed changes

rswanson mentioned this pull request May 21, 2026

fix(hot): incremental splitter; promote constants to BlockNumberList #67

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hot): split history shards by encoded size, not count#66

fix(hot): split history shards by encoded size, not count#66
dylanlott wants to merge 1 commit into
mainfrom
dylan/split-shards

dylanlott commented May 20, 2026 •

edited

Loading

Uh oh!

dylanlott commented May 20, 2026

Uh oh!

prestwich May 20, 2026

Uh oh!

prestwich left a comment

Uh oh!

prestwich commented May 21, 2026

Uh oh!

prestwich commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dylanlott commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dylanlott commented May 20, 2026

Uh oh!

prestwich May 20, 2026

Choose a reason for hiding this comment

Uh oh!

prestwich left a comment

Choose a reason for hiding this comment

Uh oh!

prestwich commented May 21, 2026

Why the current splitter is more expensive than it needs to be

Three cheaper heuristics

Recommendation: Option A

Orthogonal asks from your earlier review still stand

Uh oh!

prestwich commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dylanlott commented May 20, 2026 •

edited

Loading